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Abstract The Ord’s graph is a simple graphical method for displaying fre¬ 
quency distributions of data or theoretical distributions in the two-dimensional 
plane. Its coordinates are proportions of the first three moments, either empiri¬ 
cal or theoretical ones. A modification of the Ord’s graph based on proportions 
of indices of qualitative variation is presented. Such a modification makes the 
graph applicable also to data of categorical character. In addition, the in¬ 
dices are normalized with values between 0 and 1, which enables comparing 
data files divided into different numbers of categories. Both the original and 
the new graph are used to display grapheme frequencies in eleven Slavic lan¬ 
guages. As the original Ord’s graph requires an assignment of numbers to the 
categories, graphemes were ordered decreasingly according to their frequencies. 
Data were taken from parallel corpora, i.e., we work with grapheme frequen¬ 
cies from a Russian novel and its translations to ten other Slavic languages. 
Then, cluster analysis is applied to the graph coordinates. While the original 
graph yields results which are not linguistically interpretable, the modification 
reveals meaningful relations among the languages. 
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1 Introduction and motivation 


Ord ( 1967blf suggested a simple graphical representation of discrete probability 


distributions 0 in the two-dimensional plane - however, his idea can directly 
be applied also to continuous distributions. The coordinates of a distribution 
in the graph are given as proportions of their first three moments, namely, the 
mean /i, the variance /t 2 and the third central moment (i.e., the skewness) ^ 3 . 
In general, all distribution can be depicted for which the first three mo ments 
exist an d the first two of them are non-zero. Keeping the notation from fo73 
( 1967bl f. the x- and y-coordinates will be denoted by I and S, respectively, 


with I = ^ 2 /l-i and S = ^ 3/^2 ■ If all possible parameter values of a particular 
distribution are considered, one obtains an area (or a curve, a line, a point) 
characteristic for the distribution (we note that areas belonging to different 
distributio ns can over lap). Some of them can be seen in Figure [I] which is 
taken from lOrdi ( 1967blf . 


If theoretical moments are replaced with empirical ones, the Ord’s graph 
can also be used to display data. It can serve as a preliminary, intuitive decision 
criterion whether the data can be modelled by a particular distribution. If the 
point representing the data lies within the area of the distribution, or not 
too far away from it, a (relatively) good fit between the data and the model 
can be expected. The graph provides also, among others, a possibility of data 
classification or clustering - points representing related data are supposed to 
be close to each other. 


research, like, e.g., biology (Schneider and Duffv, 

19851, transport networks 

modelling (ITavlor. 1976; Beguin and Thomas, 1997D, linguistics (Stadlober and Dzuzelic 

2005; Grzvbek and R.uskol. 

2009), and musicology 

(Martinakova et aj, 2009h. 


However, the Ord’s graph is not applicable to data of categorical char acter 
(for a n ov erview of graphi cal methods suitable for such data see Blasius and Greenacrd . 


1998, and FriendM 2000)- Especially in the casejaf nominal data, i.e., if there 


is no natural ordering of categories (see, e.g., lAgrestil 120131 p. 3), using the 


graph would require an assignment of integers to the categories. Such an as¬ 
signment can only be arbitrary, and the arbitrariness leads almost necessarily 
to ambiguities. 

We will apply both the original Ord’s graph and its new modification (see 
Section [3]) to grapheme frequencies in Slavic languages (see Section [2] for data 
description). Grapheme orderings, as they are established in alphabets (or 


1 In order to avoid confusion, we remind that the same author also developed another 
graphi cal meth od for discrete d istributions, which was published in the same year, see lOrdi 
Il967al . and also iFriendlvl . 1200(1 

































































Fig. 1 Graphical representation of discrete distributions from lOrdl (1967b). 


other writing systems) specific for particular languages, are results of tradi¬ 
tions and/or conventions which are not linguistically substantiated in the vast 
majority of languages. Slavic languages are not exceptional in this respect. 
Moreover, two further facts mar any attempt to achieve a grapheme order¬ 
ing common to all Slavic languages. They not only have different grapheme 
inventories, but languages from this family also use writing systems based 
on two different scripts, namely, Latin and Cyrillic. These two scripts (and 
their modifications) follow different traditions of grapheme orderings, e.g., the 
grap heme z a p pears towards the end of Slavic adaptations of the Latin alpha¬ 
bet ( Comriel . Il996bi) . but its Cyrillic counterpart 3 is positioned around the 
eighth place (out of roughly 30, depending on the language, see Section [2| in 
alphabets based on the Cyrillic script (jComriei . Il996af) . 


One of reasonable possibilities left is to work with ranked frequencies, where 
the most frequent grapheme is given the rank 1, the second most frequent the 
rank 2, etc. The problem of ambiguities mentioned above is thus solved. This 
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direction of research enjoyed an increased popularity in re cent years. Th e re are 


several studies available, mainly for Slavic la nguages (s ee Grzvbek et al 20091 


and references therein), but also for German (iGrzvbekl. l2007lf. Trish and Manx 
(IWilson . l2013ll . and some languages from West Africa ( Rovenchak and Vvdrinl. 


201C). The negative hypergeometric distribution (see, e.g.. Wimrn er and Altm ann. 
1999L pp. 465-468) is tentatively considered a general mathematical model. 


However, its parameters, and hence also its moments, seem to depend on the 
inventory size (i.e., on the number of graphemes used in particular languages, 
IS henceforward; the determination of the grapheme inventory size is a com- 
plex li nguist ic issue, some details specific for Slavic languages can be found in 
Kelihl . l2013ll. The dependence withi n the Slavic language fam ily was demon¬ 


strated by Grzvbek and Kelihl (2005) and Grzvbek et, all ( 2005lf . Consequently, 
also the Ord’s graph, which exploits the moments, will reflect not only a mea¬ 
sure of relatedness among Slavic languages, but it will also be influenced by 
their inventory sizes. We will show in Section [2] that the graph constructed 
from grapheme rank-frequency distributions does not lead to linguistically ex¬ 
plainable results. 

Therefore, in Section [3] we suggest a modification of the Ord’s graph, in 
whic h m omen ts are replaced with so-called indices of qualitative variation (see 
IWilcoxl . Il973 ). The new graph reveals a meaningful classification of Slavic 
languages. 


2 Data description 


The grapheme frequencies which will be analyzed were obtained from the Rus¬ 
sian social realist novel Kak zakaljalas’ stal’ (How the Steel Was Tempered) and 
its translations to ten other Slavic languages. The book was written by Nikolai 
Ostrovsky in 1930s. It enjoyed the status of recommended reading; therefore it 
was translated to the languages spoken in the countries from the socialist bloc 
within a relatively short time period. The linguistic corpus consisting of the 
Russian (RUS henceforward, IS = 33) original and its translations into Be- 
lorusian, Bulgarian (BUL, IS = 30), Croatian (CRO, IS = 30), Czech (CZE, 
IS = 42), Macedonian (MAC, IS = 31), Polish (POL, IS = 32), Serbian (SRB, 
IS = 30), Slovene (SLO, IS = 25), Slovak (SVK, IS = 43), Ukrainian ( UKR , 
IS = 34), and Upper Sorbian (UPS, IS = 37) was described bv iKelihl (2009b). 


Belorusian was omitted from our considerations, as its orthography differs 
substantially from other Slavic languages. Belorusian has an explicit, phonet¬ 
ically determined orthographic system, i.e., letters are used for coding phones 
and not phonemes (and partly morphophonemes) as, e.g., in case of Rus¬ 
sian and Ukrainian. This different coding approach has, among others, the 
ef fect of an e xtreme overexploitation of particular graphemes (for details see 
IKelihl . 12009a ). Rank-frequency distributions of graphemes from eleven Slavic 
languages can be found in Table Q] (the languages are ordered decreasingly 
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according to their grapheme inventory sizes); they are displayed on the Ord’s 
graph in Figure [2] left. H 


Table 1: Grapheme rank-frequency distributions in Slavic lan¬ 
guages. 



SVK 

CZE 

UPS 

UKR 

RUS 

POL 

MAC 

BUL 

CRO 

SRB 

SLO 

1 

26490 

20618 

29440 

25494 

28305 

26718 

40232 

36841 

32444 

32507 

30849 

2 

23869 

20371 

27097 

22419 

23509 

25264 

30122 

24724 

25820 

25823 

29708 

3 

20564 

19595 

24691 

17958 

21205 

22229 

28420 

23098 

24952 

24709 

26129 

4 

15166 

15223 

17213 

16868 

17140 

20509 

20985 

21644 

24320 

23473 

25886 

5 

13204 

14183 

16201 

15985 

16143 

18622 

20793 

19535 

13457 

13332 

17175 

6 

12842 

12586 

14719 

14123 

14868 

14275 

17111 

17133 

13215 

13168 

15921 

7 

12233 

12174 

13527 

12146 

13980 

13344 

13634 

13867 

12958 

12888 

15045 

8 

12137 

11365 

12224 

11835 

13265 

12876 

13152 

13394 

12759 

12728 

14144 

9 

11959 

11312 

11500 

11566 

13103 

12627 

11613 

12224 

11581 

11453 

14139 

10 

11548 

10193 

10995 

10521 

12693 

12120 

10640 

11329 

10237 

9949 

12402 

11 

10010 

9639 

10640 

10339 

10004 

11170 

10591 

9197 

9958 

9929 

11569 

12 

8981 

9147 

10113 

9926 

8396 

10120 

7815 

8542 

9885 

9661 

11412 

13 

8569 

8477 

9647 

9811 

8147 

9637 

7753 

7950 

9741 

9163 

10029 

14 

8293 

8320 

8425 

8871 

7834 

9499 

7123 

7339 

9139 

8296 

9167 

15 

7389 

8252 

7725 

8327 

7733 

8933 

6327 

6197 

8384 

7958 

8753 

16 

6729 

6301 

7697 

7542 

5479 

8623 

6127 

5633 

7779 

7794 

6441 

17 

6051 

5552 

7238 

6693 

5191 

8510 

5440 

5309 

5047 

5015 

5515 

18 

5496 

5338 

7182 

5640 

5045 

6564 

5219 

4770 

4688 

4732 

5336 

19 

4282 

5229 

5625 

4759 

5026 

5964 

5191 

4554 

3808 

3889 

4755 

20 

4270 

5219 

5540 

4618 

4957 

5354 

4360 

4344 

3768 

3797 

4429 

21 

4267 

4719 

5341 

4215 

4498 

4613 

3203 

4035 

3075 

3004 

3054 

22 

3697 

4207 

5201 

3977 

3679 

4387 

2015 

3220 

2258 

2239 

2923 

23 

3352 

4103 

4135 

3952 

3288 

4361 

1798 

2681 

2225 

2194 

1967 

24 

2772 

3290 

4024 

3038 

2859 

3714 

1540 

2197 

1810 

1832 

1893 

25 

2498 

3169 

3579 

2963 

2667 

3199 

803 

1956 

1769 

1703 

230 

26 

2424 

2932 

3412 

2486 

2506 

2548 

563 

1936 

1709 

1592 


27 

2358 

2650 

2888 

2101 

1556 

2052 

365 

1464 

1665 

1512 


28 

1867 

2583 

2867 

1937 

1098 

1851 

303 

362 

637 

649 


29 

1722 

2460 

2813 

1430 

971 

1220 

171 

336 

241 

278 


30 

1456 

2098 

2668 

1340 

539 

416 

66 

320 

55 

77 


31 

1276 

2032 

2241 

878 

312 

406 

35 





32 

642 

892 

607 

282 

59 

254 






33 

601 

541 

505 

242 

0 







34 

581 

253 

276 

1 









2 Two from among currently spoken standard Slavic languages were not included: Be- 
lorusian, as was explained, was omitted because of its peculiar orthography; and Lower 
Sorbian, because no suitable texts (i.e., long enough and comparable with analogous texts 
in other Slavic languages) could be found (the language has about 7000 speakers only). We 
do not intend to discuss here the status of one language/different languages/dialects of, e.g., 
Ukrainian/Rusyn, Bosnian/Croatian/Montenegrin/Serbian, Polish/Cassubian, etc. 
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Fig. 2 Original Ord’s graph applied to grapheme frequencies (left) and inventory sizes in 
Slavic languages (right), with cluster analysis applied to graph coordinates. 


Table 1: continued 


35 

366 

213 

0 

36 

320 

188 

0 

37 

186 

182 

0 

38 

100 

169 


39 

94 

86 


40 

30 

12 


41 

10 

7 


42 

6 

0 


43 

0 




Since the beginning of modern Slavic linguistics and typology in the mid- 
19th century, the classification of Slavic languages has been discussed many 
times. By now, a simple typology based on the geographical location of the 
Slavic standard languages is more or less accepted; it divides the languages 
into three groups: East Slavic (Belorusian, Russian, Ukrainian), West Slavic 
(Czech, Polish, Slovak, Upper and Lower Sorbian), and South Slavic (Bulgar¬ 
ian, Croatian, Macedonian, Serbian, Slovene). 

Cluster analysis was applied to the I- and .S-coordinates from the Ord’s 
graph, with three clusters required (indicated by ellipses in Figure [2] left). 
Clustering was performed in statistical software R. Two methods were used, 
namely, k-nreans and k-medoids. In Figure [2] left, they yield the same clusters 
regardless of the choice of the algorithm for the k-means method (Hartigan- 
Wong, Lloyd, MacQueen) and of the metric for the k-medoids method (Eu- 
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clidean, Manhattan). Figure [2] right presents clusters resulting of the k-means 
method; the k-medoids method gives clusters almost identical to the ones from 
Figure [3] (the only difference is that UPS migrates into the cluster containing 
CZ E and SVK). Fo r a (relatively) short overview of the cluster analysis see, 
e.g., [izenma 3 d2008h . 

The results obtained are not linguistically meaningful (e.g., East Slavic 
languages form one group with most of South Slavic ones; on the other hand, 
Slovene is a single outlier, which is not explainable, since the historical devel¬ 
opment of its writing system is parallel with the other Slavic languages, etc.). 
The only clue hinting towards a linguistic explanation is the grapheme inven¬ 
tory size of the languages analyzed, as the clusters coincide with the ones based 
on the sizes of grapheme inventories (Figure [2] right). Grapheme inventories, 
however, reflect history, traditions, conventions, etc. of a language (see also 
Section [l]) more than linguistic laws and relations among languages; further¬ 
more, they are extremely conservative and almost resistant to changes (which, 
if occur, follow more often than not sudden historical/political changes, and 
not slow, continuous changes of languages). 

Given that moments of the grapheme rank-frequenc y dist ributions de pen d, 
at least for Slavic lan guages, on the inventory sizes (IGrzvbek and Kelilil . 12005 : 
Grzvbek et al . 2005ll . the coincidence of clusters in Figured] is not surprising. 


3 Modified Ord’s graph 


Consider N data items divided into K categories and denote fi the frequency 
of the i-tli category. Iwilcox (1973) discussed in his paper several measures of 
variation applicable (also) to nominal data, among them the variance analogue 


V A = 1 — 


the standard deviation analogue 


Eh (fi - £) 

N 2 (K- 1 ) 

K 


(i) 


SDA = 1 - 


\ 


Etr {fi ~ %Y 

N 2 (K-1) 

I< 


and the relativized entropy 


-EiLlPi^SPi 

logK 


RE = 


( 2 ) 


(3) 


where lo g den o tes the natural logarithm. 

In Iwilcox ( 1973ll . these measures are called indices of qualitative varia¬ 
tion. They have at least two properties which distinguish them from the usual 
measures of variation (like the variance, the standard deviation, etc.). First, 
they are invariant with respect to the ordering of categories, i.e., they depend 
solely on frequencies. Second, all of them are normalized, with possible values 
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from the interval [0,1] (for all of them, value 0 is attained if all objects are 
in one category and other categories are empty; value 1 corresponds to the 
uniform distribution, with all categories having the same frequencies). Thus, 
if one considers grapheme frequencies in Slavic languages, indices of qualita¬ 
tive variation can be a response to ambiguities related to the two traditions 
of grapheme orderings. They also eliminate influences of different inventory 
sizes. 

Given these advantages, we applied the indices ©-© to modify the Ord’s 
graph. The modified coordinates are defined as 

I m = SDA/VA (4) 

and 

S m = RE/SDA. (5) 

It is easy to see that I m could be simplified to the form 


Im — 1 + 


\ 


ZL (/■ - ft 

N 2 (K- 1 ) 

I< 


However, out of two reasons we prefer here to keep the form first, to 
highlight an analogy with the original Ord’s graph (other measures of qual¬ 
itative variation can be more useful for analyses of other types of data, see 
Section [4]) . Second, specifically for linguistic data, the form © can be more 
simple to interpret. Its denominator is, in fact, the normalized repeat rate 


RR n 


K 


N 2 


K - 1 


1 - 


see iGibbs and Postonl ( 1975 1. which is one of the standard characteristics in 
linguistics. 

In Figure [3] the new graph can be seen, applied, again, to grapheme fre¬ 
quencies in Slavic languages (we emphasize that the order of graphemes within 
a language is irrelevant in this case). Clusters created from its coordinates 
I m and S m (ellipses in Figure © present a pattern quite different from Fig¬ 
ure [2] The proposed classification reveals interesting findings on the typology 
of Slavic languages (the resulting clusters are the same, again, regardless of 
the method, algorithm or metric used, see Section©. 

First of all, there is a group of South Slavic languages, which perfectly 
fits with their geographical location. Bulgarian, Croatian, Macedonian, Ser¬ 
bian, and Slovene form one homogenous group. The orthographic systems of 
these languages are well organized with respect to the economy of coding of 
some specific prosodic features (like the pitch accent in Croatian, Serbian, and 
Slovene) and to the marking of palatalized consonants in Bulgarian (marked 
with a specific vocalic grapheme). Macedonian is one of the youngest standard 
languages (codified in 1945) and its orthography is largely based on the same 
principles as Serbian (one letter for one sound). 
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l_m 


Fig. 3 Modified Ord’s graph applied to grapheme frequencies, with cluster analysis applied 
to graph coordinates. 


The second group can be called the basic West Slavic languages, it includes 
Czech and Slovak. The two languages are typologically quite similar in general, 
including their orthographic and phonemic systems, and thus their location in 
one group is justified. 

In Figure[3l Russian, Ukrainian, Polish and Upper Sorbian form one group. 
If one compares it with the traditional geographical classification, this North 
Slavic group seems to be a mixture of East Slavic (Russian and Ukrainian) and 
West Slavic languages (Polish and Upper Sorbian). However, if orthographic 
and phonemic criteria are taken into account, these languages share some com¬ 
mon features, namely, they are characterized by a systematic correlation of the 
consonantal system palatalization (i.e., consonants tend to have both “hard” 
and “soft” versions). Indeed, these characteristics play a very important role 
in Russian and Ukrainian, whereas a regression of palatalization was reported 
for Polish and especially for Upper Sorbian. 


The groups resulting from the cluster analysis of the modified Ord’s graph 
coordinates differ slightly from the traditional, area-based typology of Slavic 
languages, but they suggest another, linguistically justifiable classification. It 
corresponds to the approach of lKolomiec et all ( 19861 ). where a group of North 
Slavic languages (Russian, Ukrainian, Polish) is mentioned; they are character¬ 
ized by a high number of consonants in their inventories, whereas South Slavic 
languages mainly enlarged their vowel inven tory (for a det ailed discussion of 
vocalic and consonantal Slavic languages see Sawicka. 11991 1. 
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4 Conclusion 


Our modification of the Ord’s graph brings linguistically motivated and inter¬ 
pretable results. Cluster analysis applied to the coordinates of the new graph 
reveals groups of languages which share some common features, as far as or¬ 
thography and phonology is concerned. Thus, the application of the modified 
Ord’s graph to grapheme frequencies can be seen as a contribution towards 
the typology of Slavic languages. If compared with their traditional, purely ge¬ 
ographical classification, the new approach has the advantage of being based 
on empirically observed data. 

Admittedly, the definition of the modified graph coordinates (0]) and ([5]) 
used in this paper - i.e., the choice of indices 0-0 - is heuristic only. Apart 
from the fact that they yield linguistically relevant results in this case, there is 
no other reason why they should be pr eferred. It can be expected that other in- 
dices of qualitative variation (see, e.g.. lWilcoxl . 1973; Gibbs and PostonL 1975) 
can be more reasonable for categorical data arising from other branches of 
science. 

Regardless of the choice of the indices, the method is computationally very 
simple; results it yields are also easy to understand, as they are displayed in 
the two-dimensional plane. In addition, it represents categorical data by two 
real-valued coordinates, enabling thus applications of statistical classification 
or clustering methods. 
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